Benchmark Datasets for Graph Layout Algorithms

Authors
Affiliations

Sara Di Bartolomeo

University of Konstanz

Tarik Crnovrsanin

Northeastern University

Connor Wilson

Northeastern University

Eduardo Puerta

Northeastern University

Alexander Frings

University of Konstanz

Cody Dunne

Northeastern University

Abstract

Introduction

Computational evaluations are crucial for standardized and objective evaluation of graph layout algorithms. Standard benchmark datasets facilitate comparison with prior work, and reliable access to datasets is fundamental for replicability. However, there is no comprehensive repository of benchmark datasets for Graph Drawing, and many datasets have been lost.

Data collection

We collected 196 papers from Graph Drawing, IEEE venues, and Eurographics venues that include computational evaluations of graph layout algorithms. We searched for the datasets used.

Data analysis

We archived datasets we found and re-created ones that were lost but had sufficient replication instructions. We classified graphs by their features and statistics. We also found text and images from papers using those graphs.

Implementation

We implemented graph creation, data analysis, and website code.

Materials

We provide a Graph Benchmark Datasets website; a repository with code for the website and metadata on the benchmark datasets; a long-term archive of this repository and the benchmark datasets.

Conclusion

We provide a resource for the Graph Drawing and visualization communities to use to find datasets for computational evaluations of graph layout algorithms. Our organization by features and statistics supports rapid identification of relevant graphs. We have re-created and archived graphs used in research for their long-term preservation.

Research materials
Authorship
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Conflicts of interest

See the corresponding section in the author guide.

1 Introduction

Benchmarking is a crucial aspect of computer science, as it allows researchers, developers, and engineers to compare the performance of various systems, algorithms, or hardware. A benchmark is a standardized test or set of tests used to measure and compare the performance of hardware, software, or systems under specific conditions. Benchmarking aims to provide objective and consistent metrics that allow for fair comparisons and informed decision-making. Benchmarks are widely used in various fields, including computer hardware evaluation, software optimization, and system performance analysis. In all these fields, benchmarking provides a standardized and objective way to compare and assess the performance of different systems, algorithms, or software implementations. It aids in making informed decisions about which solution best suits a specific use case or requirement.

The same is true for graph drawing, particularly for studying the performance and results of graph layout algorithms (Di Bartolomeo et al. 2024). Benchmark datasets can provide a standardized set of graphs with known properties and characteristics. These graphs can vary in size, density, connectivity, and structure. Researchers can objectively compare their performance or the quality of their results by applying various graph layout algorithms to the same benchmark dataset.

In our own work, we have faced challenges in determining which benchmark datasets to use for evaluating the layout algorithms we developed. This led us to build a collection of benchmark datasets used in previous graph layout algorithm papers and a Graph Benchmark Datasets website for perusing the collection. We collected 196 papers from Graph Drawing, IEEE venues, and Eurographics venues that include computational evaluations of graph layout algorithms. We then searched for the datasets used for the benchmarks. We collected the data we could find and had permission to archive and re-created datasets that were lost but had sufficient replication instructions. We classified graphs by their features and statistics. We also found text and images from papers using those graphs.

This paper aims to present this graph drawing benchmark sets resource to the Graph Drawing and visualization communities so that other authors may benefit from our archiving and organization efforts. We hope this resource will encourage the discoverability of these datasets and the ease of running benchmarks for graph layout algorithms. Moreover, as reliable access to datasets is fundamental for replicability, we aim to preserve these datasets in perpetuity. Beyond collecting available datasets and re-creating lost ones, we archived all our materials on OSF for long-term availability. This included saving each graph in multiple common file formats to avoid translation issues for individual authors. We believe our work will lead to more reproducible and replicable Graph Drawing research by providing a long-term and open archive of the data we use in our computational evaluations.

Specifically, we contribute:

  • A collection of benchmark datasets used for graph layout algorithm evaluations, including:
    • Re-creating lost datasets based on paper descriptions and a list of unavailable data,
    • A classification of these datasets by graph features and statistics to aid in identifying appropriate evaluation candidates,
    • Exemplar text and images from papers that use these datasets,
    • A website for perusing this collection, and
    • A long-term archive of our metadata and the collection to aid in reproducibility and replicability of evaluations.

Please see Section 5 below for our supplemental materials, including links to the website, code and data, OSF archive, and Notion database.

2 Collection process

The information we collected is a by-product of a larger systematic review we conducted related to graph layout algorithms, which included 196 papers The following figure shows the original data collection process (from (Di Bartolomeo et al. 2024)):

3

The sources used for the collection of papers. We started from the last 7 years of graph drawing, then integrated with the results from IEEE and Eurographics search, and then we also included all the most cited papers in Graph Drawing. The final result was 196 papers containing layout algorithm evaluations.

The core of our data collection was the last seven years of Graph Drawing proceedings (264 papers in total), filtering out papers without computational evaluations. We further expanded our graph layout algorithm papers collection by searching IEEE Xplore and Wiley digital libraries to include papers from TVCG and CGF. Then, we checked all the citations in the papers we collected from Graph Drawing, and added to our collection all the papers that were cited more than 4 times in the last 6 years of graph drawing — to make sure we included algorithm papers that were important, but not published at GD, on IEEE Xplore or on the Wiley digital library. For each paper, we collected which features were handled by the graph layout algorithm presented, and what dataset was used in the evaluation. When collecting features, we always prioritized the authors’ own wording and description of the features. The tagging of the papers was done by two people at the same time, over two different passes for sanity-checking purposes. Following this process, we tried to track down the datasets used in computational evaluations: (1) we first looked for official or linked supplemental material, (2) we next Googled the dataset or paper name, (3) finally, we emailed the authors. When in doubt about the artifact replication policy, we asked the owners or authors by email. In cases where it was explicitly mentioned that approval should be received before redistribution, we did not redistribute the datasets. However, if we received approval or did not receive an answer and found no explicit policy preventing redistirbution, we collected and stored the dataset to preserve it for future researchers. If any dataset owner or author discovers their own work in our collection and would like it removed, we kindly request that they contact us (see Section 7 below), and we will promptly remove it. Furthermore, we want to emphasize that we do not assert any ownership rights over the datasets listed.

The chart below shows the distribution of papers across different venues:

The following one, instead, shows the distribution of collected papers’ publication date:

After collecting the datasets, we looked more in-depth into their contents, running analysis on a number of statistics associated with the graphs contained in them.

We finally ended up collecting 46 datasets, listed below:

The purpose of our collection effort is to facilitate comparisons and replication of results, and to provide a resource for researchers to find datasets that are used in the literature. We also provide links to the specific papers that use each dataset, so that they can be used as examples and sources of information.

3 Datasets in use

The following chart shows how many times we found a dataset being used in the papers we collected. It excludes custom edits to the datasets, which are discussed later in this document.

In the data we collected, the most used dataset is Rome-Lib, followed by assorted collaboration networks (which in many cases refers to datasets of academic collaborations such as dblp or vispubdata). The third most used dataset is from C.Walshaw - it is important to note that the Walshaw dataset is available as part of other collections - for instance, its graphs are found in SNAP. However, during the collection process, we preferred giving precedence to how the authors reported their own information. Thus, if the authors claimed the data was from the Walshaw collection, we reported it as such.

In the following sections, the reader will find details about the classifications and datasets in detail. Each dataset gets a dedicated, collapsible section, that contains the following information: - A sparkline visualization is shown in case the dataset contains multiple graphs, illustrating the distribution of graph sizes (in nodes) in the graphs in the dataset.

3.1 Classification of the Datasets

We classified the datasets in different categories, based on their origins or amount of graphs contained in them:

Uniform Benchmark datasets

Uniform Benchmark datasets are standalone widely used collections of graphs featuring uniform characteristics - usually simple, generic graphs, often used in evaluations that run over thousands of graphs to report average completion times, or other experiments where the reported metrics are usually aggregated.

The first of these collapsible sections is shown already expanded, to give an example of the content that can be found in each of them. The content is generated dynamically based on the data we collected.

Rome-Lib

Rome-Lib is, as previously discussed, the most common benchmark dataset, due to its already established popularity, its ease of use and access, and the many properties that are already known about it. It was first introduced in (Battista et al. 1997) and presented in their paper ”An experimental comparison of four graph drawing algorithms”. Mostly “real” networks such as software companies, government, software engineering books, database design, and journal articles on visualization. Originally sent as an extended abstract to Computational Geometry in 1995 (Di Battista, Garg, and Liotta 1995). It contains exclusively generic graphs (e.g., undirected, non-layered, without pre-established clusters…), although some of the research that uses it enhances with additional attributes (such as performing a rank assignment step). Information about optimal crossings can be found as part of WebCompute.


Download formatted data: GEXF GML GRAPHML JSON

Size: 11534 graphs, 10 to 100 nodes, 9 to 158 edges

Origin paper:
An experimental comparison of four graph drawing algorithms
An experimental comparison of three graph drawing algorithms (extended abstract)

Usage examples:
A branch-and-cut approach to the crossing number problem
generic
Layer-Free Upward Crossing Minimization
directed edges
layered graphs
n-layers
Experiments on Exact Crossing Minimization Using Column Generation
generic
Stratisfimal Layout: A modular optimization model for laying out layered node-link network visualizations
clusters (pre-existing)
compound graphs
layered graphs
n-layers
Aesthetic Discrimination of Graph Layouts
generic
A New Approach to Exact Crossing Minimization
generic
Improved circular layouts
circular
Deep Neural Network for DrawiNg Networks
An Experimental Study of Crossing Minimization Heuristics
generic
An experimental comparison of four graph drawing algorithms
generic
A Note on the Practicality of Maximal Planar Subgraph Algorithms
planar
Experimental Analysis of the Accessibility of Drawings with Few Segments
sparse
trees
An Interactive Tool to Explore and Improve the Ply Number of Drawings
generic
A Greedy Heuristic for Crossing-Angle Maximization
generic
A Heuristic Approach Towards Drawings of Graphs with High Crossing Resolution
generic
An Integer-Linear Program for Bend-Minimization in Ortho-Radial Drawings
ortho
ortho-radial
Star-Struck by Fixed Embeddings: Modern Crossing Number Heuristics
generic
An effective crossing minimisation heuristic based on star insertion
generic
Advances in the Planarization Method: Effective Multiple Edge Insertions
generic
An ILP-based Proof System for the Crossing Number Problem
generic


Statistics

102030405060708090100110Node Count (binned)02004006008001,0001,2001,4001,6001,8002,0002,200020406080100120140160Edge Count (binned)05001,0001,5002,0002,5003,0000.51.01.52.02.53.03.54.0Mean Degree (binned)01,0002,0003,0004,0005,0006,0007,0008,0002.04.06.08.010.012.014.0Maximum Degree (binned)05001,0001,5002,0002,5003,0003,5004,0004,5005,000

Descriptions from Literature

From “A branch-and-cut approach to the crossing number problem”:

To test the performance of our new algorithm, we used a benchmark set of graphs of the University of Rome III, introduced in [11]. The set contains 11 389 graphs that consist of 10 to 100 nodes and 9 to 158 edges. These graphs were generated from a core set of 112 “real life” graphs used in database design and software engineering applications. Most of the graphs are sparse, which is a common property in most application areas of automatic graph drawing. The average ratio between the number of edges and the number of nodes of the graphs from the benchmark set is about 1.35.

From “Layer-free upward crossing minimization”:

The Rome graphs [Di Battista et al. 1997] are a widely used benchmark set in graph drawing, obtained from a basic set of 112 real-world graphs. It contains 11,528 instances with 10 through 100 nodes and 9 through 158 edges. Although the graphs are originally undirected, they have been used as directed graphs—by artificially directing the edges according to the node order given in the input files—for showing the performance of the mixed-upward planarization approach [Eiglsperger et al. 2003]. In this case, all edges are directed and the graphs are acyclic; hence, the mixed-upward planarization approach turns into an upward planarization method.

From “A New Approach to Exact Crossing Minimization”:

We say graphs are trivial, if they are planar or if the heuristic achieves a planarization with only one crossing, as in these cases we need not prove optimality. The Rome library contains 7172 non-trivial graphs.

From “An Experimental Comparison of Four Graph Drawing Algorithms

Our test graph generation strategy is as follows. First, we have focused on the important application area of database and software visualization, where Entity-Relationship diagrams and Data-Flow diagrams are usually displayed with orthogonal drawings. Second, we have collected 112 “real life” graphs with number of vertices between 10 and 100, from now on called core graphs, from the following sources:

• 54% of the graphs have been obtained from major Italian software companies (especially from Database Informatica) and large government organization (including the Italian Internal Revenue Service and the Italian National Advisory Council for Computer Applications in the Government (Autorita’ per l’Informatica nella Pubblica Amministrazione)); • 33% of the graphs were taken from well-known reference books in software engineering [18] and database design [1], and from journal articles on software visualization in the recent issues of Information Systems and the IEEE Transactions on Software Engineering; • 13% of the graphs were extracted from theses in software and database visualization written by students at the University of Rome “La Sapienza”.

Our approach is based on the following scheme. We defined several primitive operations for updating graphs, which correspond to the typical operations performed by designers of Entity-Relationship and Data-Flow Diagrams, and attributed a certain probability to each of them. More specifically, the updating primitives we have used are the following: InsertEdge, which inserts a new edge between two existing vertices; DeleteEdge, which deletes an existing edge; InsertVertex, which splits an existing edge into two edges by inserting a new vertex; DeleteVertex, which deletes a vertex and all its incident edges; and MakeVertex, which creates a new vertex and connects it to a subset of vertices. The test graphs were then generated in several iterations starting from the core graphs by applying random sequences of operations with a “genetic” mechanism. Namely, at each iteration a new set of test graphs was obtained by applying a random sequence of operations to the current test set. Each new graph was then evaluated for “suitability”, and those found not suitable were discarded. The probability of each primitive operation was varied at the end of each iteration. The evaluation of the suitability of the generated graphs was conducted using both objective and subjective analyses. The objective analysis consisted of determining whether the new graph had similar structural properties with respect to the core graph it was derived from. We have taken into account parameters like the average ratio between number of vertices and number of edges and the average number of biconnected components. The subjective analysis consisted in a visual inspection of the new graph and an assessment by expert users of Entity-Relationship and Data-Flow diagrams of its similarity to a “real-life” diagram. For obvious reasons, the subjective analysis has been done on a randomly selected subset of the graphs.

Example Figures

From ”An Experimental Comparison of Four Graph Drawing Algorithms”:

From “Experimental Analysis of the Accessibility of Drawings with Few Segments”:

From: Deep Neural Network for DrawiNg Networks, (DNN)2:

Fig 5. Layout examples for (DNN)^2, (DNN)^2, tsNET∗ and S_GD^2.

The following collections, together with Rome-Lib, can be easily accessed from the homepage of the Graph Drawing Conference website, and are therefore well known and widely used.

Originally collected by Stephen North at AT&T Bell Labs (see the descriptions from literature section below for more information). The original link from 1995 is broken: ftp://ftp.research.att.com/dist/drawdag. Di Battista et al. modified the dataset by removing isomorphic graphs, connecting disconnected graphs, and removing cycles. Same as North DAG collection. It contains directed and acyclic graphs.


Download formatted data: GEXF GML GRAPHML JSON

Size: 1277 graphs, 10 to 100 nodes, 9 to 241 edges

Origin paper:
(Battista et al. 2000)    Drawing Directed Acyclic Graphs: An Experimental Study

Usage examples:
Drawing Large Graphs with a Potential-Field-Based Multilevel Algorithm
generic
large
An Experimental Comparison of Fast Algorithms for Drawing General Large Graphs
generic
large
Large-Graph Layout with the Fast Multipole Multilevel Method
generic
large
Compact Layered Drawings of General Directed Graphs
directed edges
layered graphs
n-layers
A Flow Formulation for Horizontal Coordinate Assignment with Prescribed Width
dag
hierarchical
layered graphs
n-layers
A Natural Quadratic Approach to the Generalized Graph Layering Problem
layered graphs
n-layers
Advances in the Planarization Method: Effective Multiple Edge Insertions
generic
Simple and Efficient Bilayer Cross Counting
bipartite
layered graphs
Drawing Directed Acyclic Graphs: An Experimental Study
directed edges
layered graphs
n-layers
Measuring Symmetry in Drawings of Graphs
generic


Statistics

102030405060708090100Node Count (binned)050100150200250300350400450500050100150200250Edge Count (binned)01002003004005006007008009000.02.04.06.08.010.012.014.016.0Mean Degree (binned)01002003004005006007008009001,00001020304050607080Maximum Degree (binned)0100200300400500600700800900

Descriptions from Literature

From “Drawing Directed Acyclic Graphs: An Experimental Study”:

The experimental study was performed on two different sets of DAGs, both with a strong connection to “real-life” applications. We considered two typical contexts where DAGs play a fundamental role, namely software engineering and project planning.

The first set of test DAGs are what we call the North DAGs. They are obtained from a collection of directed graphs [28], that North collected at AT&T Bell Labs by running for two years Draw DAG, an e-mail graph drawing service that accepts directed graphs formatted as e-mail messages and returns messages with the corresponding drawings [27].

Originally, the North DAGs consisted of 5114 directed graphs, whose number of vertices varied in the range 1 … 7602. However, the density of the directed graphs with a number of vertices that did not fall in the range 10 … 100 was very low (see also the statistics in [28]); since such directed graphs represent a very sparse statistical population we decided to discard them. Then we noted that many directed graphs were isomorphic; since the vertices of the directed graphs have labels associated with them, the problem is tractable. For each isomorphism class, we kept only one representative directed graph. Also, we deleted the directed graphs where subgraphs were specified as clusters, to be drawn in their own distinct rectangular region of the layout, because constrained algorithms are beyond the scope of this study. This filtering left us with 1277 directed graphs.

Still, 491 directed graphs were not connected and this was a problem for running algorithms implemented in G D W (they assume input directed graphs to be connected). Instead of discarding the directed graphs, we followed a more practical approach, by randomly adding a minimum set of directed edges that makes each directed graph connected. Finally, we made the directed graph acyclic, where necessary, by applying some heuristics for inverting the direction of a small subset of edges.

We then ran a first set of experiments and produced the statistics by grouping the DAGs by number of vertices. Although the comparison among the algorithms looked consistent (the produced plots never oddly overlapped), each single plot was not satisfactory, because it showed peaks and valleys. We went back to study the test suite and observed that grouping them by number of vertices was not the best approach. In fact, the North DAGs come from very heterogeneous sources, mainly representing different phases of various software engineering projects; as a result, directed graphs with more or less the same number of vertices may be either very dense or very sparse.

Since most of the analyzed quality measures strongly depend on the number of edges of the DAG (e.g. area, number of bends, and number of crossings), we decided that a better approach was to group the DAGs by number of edges. After some tests, we clustered the DAGs into nine groups, each with at least 40 DAGs, so that the number of edges in the DAGs belonging to group i, 1 ≤ i ≤ 9, is in the range 10 i … 10 i+9 (see Fig. 3). The resulting test suite consists of 1158 DAGs, each with edges in the range 10 … 99.

From “Layer-Free Upward Crossing Minimization”:

North DAGs. The North DAGs have been introduced in an experimental comparison of algorithms for drawing DAGs [Di Battista et al. 2000]. The benchmark set contains 1,158 DAGs collected by Stephen North, which were slightly modified by Di Battista et al. The graphs are grouped into nine sets, where set i contains graphs with 10 i to 10 i+9 arcs for i=1, …, 9.

Example Figures

From “Drawing Directed Acyclic Graphs: An Experimental Study”:

From “A Natural Quadratic Approach to the Generalized Graph Layering Problem”:

From “A Flow Formulation for Horizontal Coordinate Assignment with Prescribed Width”:

This is the same collection as AT&T, but referred to as North DAGs in the papers we analyzed. We kept the two collections separate, based on how they are referred to in the papers we collected.

Originally found at: See AT&T

Download formatted data: GEXF GML GRAPHML JSON

Size: See AT&T

Origin paper:
Drawing Directed Acyclic Graphs: An Experimental Study

Usage examples:
Layer-Free Upward Crossing Minimization
directed edges
layered graphs
n-layers
A Note on the Practicality of Maximal Planar Subgraph Algorithms
planar
A Generalization of the Directed Graph Layering Problem
directed edges
layered graphs
n-layers
A Greedy Heuristic for Crossing-Angle Maximization
generic
Star-Struck by Fixed Embeddings: Modern Crossing Number Heuristics
generic
An ILP-based Proof System for the Crossing Number Problem
generic
Drawing Directed Acyclic Graphs: An Experimental Study
directed edges
layered graphs
n-layers
Placing Arrows in Directed Graph Drawings
directed edges
Aesthetic Discrimination of Graph Layouts
generic
Placing Arrows in Directed Graph Layouts: Algorithms and Experiments
directed edges


Descriptions from Literature

From “Layer-Free Upward Crossing Minimization”:

North DAGs. The North DAGs have been introduced in an experimental comparison of algorithms for drawing DAGs [Di Battista et al. 2000]. The benchmark set contains 1,158 DAGs collected by Stephen North, which were slightly modified by Di Battista et al. The graphs are grouped into nine sets, where set \(i\) contains graphs with \(10 i\) to \(10 i+9\) arcs for \(i=1, \ldots, 9\).

The randDAG collection concludes the collections that can be easily accessed from graphdrawing.org. http://graphdrawing.org highlights the DAGmar graph generator, and provides a benchmark set of randomly generated directed acyclic graphs. While not explicit, it is implied this benchmark comes from the DAGmar generator. The collection is uniformly sampled from set of level graphs with certain graph parameters, such as number of nodes and number of edges. These graphs have no particular features, thus they are classified as generic.


Download formatted data: GEXF GML GRAPHML JSON

Size: 10-100 nodes, 15-167 edges

Origin paper:
DAGmar: Library for DAGs

Usage examples:
Aesthetic Discrimination of Graph Layouts
generic


Statistics

102030405060708090100Node Count (binned)0102030405060708090100110020406080100120140160180Edge Count (binned)0204060801001202.802.903.003.103.203.303.403.503.603.703.80Mean Degree (binned)0501001502002503003504004504.05.06.07.08.09.010.011.012.013.014.0Maximum Degree (binned)020406080100120140160180200220

Descriptions from the Literature

From “Aesthetic Discrimination of Graph Layouts”:

We have assembled such a dataset using two types of sources. First, we used the collections of the well-known graph archives ROME, NORTH and RANDDAG which are published on graphdrawing.org as well as the NIST’s “Matrix Market” [2].

Example Figures

From “Aesthetic Discrimination of Graph Layouts (Appendix)”:

Fig. 5. (cropped)… All graphs are visualized using the \(FM^3\) algorithm.

The following collection is particularly varied in features:

While graphviz is a graph visualization software, its example gallery has proven useful to many researchers as a source of benchmark datasets. The graphs have various origins, most of which are described on the Graphviz https://www.graphviz.org/gallery/. Example graphs used to generate images with the https://www.graphviz.org for their https://www.graphviz.org/gallery/. These graphs are picked as graphviz examples because they are diverse in types of features they contain: together with generic graphs, there are also graphs with clusters, layers, directed edges and labeled nodes.


Download formatted data: GEXF GML GRAPHML JSON

Size: 1-1464 nodes, 0-5806 edges

Origin paper:
An open graph visualization system and its applications to software engineering

Usage examples:
Optimal k-level planarization and crossing minimization
layered graphs
n-layers
Node Overlap Removal by Growing a Tree
large
FORBID: Fast Overlap Removal by Stochastic GradIent Descent for Graph Drawing
generic


Statistics

02004006008001,0001,2001,4001,600Node Count (binned)02040608010012014016018020001,0002,0003,0004,0005,0006,000Edge Count (binned)0204060801001201401601802000.02.04.06.08.010.012.0Mean Degree (binned)0102030405060708090100110050100150200250Maximum Degree (binned)020406080100120140160180200

Descriptions from Literature

From “Node Overlap Removal by Growing a Tree”:

Our data includes the same set of graphs that was used by the authors of PRISM to compare it with other algorithms [6]. The dataset is available in the Graphviz open source package.

From “Optimal k-level planarization and crossing minimization”:

The first set of graphs are all the hierarchical network diagrams appearing in the GraphViz gallery [3]… Table 1 shows the results of minimizing edge crossings and maximizing planar subgraphs with MIP and SAT solvers, as well as the crossings resulting in the Graphviz heuristic layout for graphs from the GraphViz gallery.

Example Figures

From “Node Overlap Removal by Growing a Tree”:

From “FORBID: Fast Overlap Removal by Stochastic GradIent Descent for Graph Drawing”:

From “Optimal k-level planarization and crossing minimization”:

The following two collections are particularly relevant for the display of temporal event sequences:

The storylines dataset is particularly useful for temporal event sequence visualization because of its dynamic aspect and clusters (which also evolve through time). It is a collection of graphs that represent movie plots, and the nodes are the characters in the movie. The edges represent the interactions between the characters. This dataset was initially collected by Yuzuru Tanahashi, then stored on his homepage at UC Davis which was lost. Through the help of personal connections at UC Davis, we were luckily able to recover the dataset.

Collected as part of the correlatesofwar.org project, the dataset contains two disconnected temporal event sequences.

//| echo: false

make_sparkline("Militarized_InterstateDisputesMID")

Complete graphs refers to a generic collection of graphs that are fully connected, i.e., each node is connected to every other node, up to any number of nodes that are needed for the purpose of the experiment. The linked dataset includes both the complete graphs \(K_n\) for \(5≤n≤50\) and the complete bipartite graphs \(K_{n_1,n_2}\) for \(5≤n1,n2≤40\). Crossing number is conjectured for most of these, and while not proven, we found these used in papers to validate minimum crossing numbers. We also note that the papers in our literature review did not provide example figures.

KnownCR is a collection of graphs with known crossing numbers, and it is used to test the performance of algorithms that approximate the crossing number of a graph. The dataset is used in papers that aim to approximate the crossing number of a graph, and it is used to compare the results of the approximation algorithms to the known crossing numbers of the graphs. A good resource for this is also the survey provided by (Clancy, Haythorpe, and Newcombe 2019a). The method by which they can be created is fully described in https://eldorado.tu-dortmund.de/handle/2003/27430?mode=full.

The dataset is comprised of instances of graphs uploaded to Crossing Number WebCompute, attributed to M. Chimani, T. Wiedera. http://crossings.uos.de/cr-proof-system-paper. Their website also features proofs of the crossing numbers of many of the Rome-Lib graphs. Newest version of database specifies non-planar graphs, but older versions do have some planar graphs.

Collection consisting of graphs from various sources including topological meshes, meshes related to physical problems (fluid dynamics, structural mechanics, combinatorial optimization), and interprocess communication graphs for a parallel computing implementation of a factorization solver. The https://gitlab.inria.fr/scotch/scotch is produced by the https://www.labri.fr/perso/pelegrin/scotch/ whose goal is to study static mapping by the means of graph theory, using “divide and conquer’’ graph bipartitioning heuristics. The original link to the data http://www.labri.u-bordeaux.fr/Equipe/PARADIS/Member/pelegrin/graph is broken.

Datasets contain various attributes for a few locations and their geographical adjacency, namely neighboring states, countries, and municipalities. World Bank country information transformed into a weight-vectors dataset. Some of the incomplete data was filled from disparate sources mentioned in the paper’s supplemental materials: https://doi.org/10.48550/arXiv.1908.07291.

SteinLib is a collection of Steiner tree problems in graphs and variants. Their https://steinlib.zib.de/steinlib.php has additional information about each problem in the dataset, including whether or not it has been solved. We omit two sets, https://steinlib.zib.de/showset.php?Relay and https://steinlib.zib.de/showset.php?EFST, from our provided data due to size constraints.

According to kegg.jp, “the KEGG PATHWAY database is a collection of manually drawn graphical diagrams, called KEGG pathway maps, for metabolic pathways, signaling pathways, pathways involved in various cellular processes and organismal systems, and perturbed pathways associated with human diseases.”

Collected by the authors of “https://almob.biomedcentral.com/articles/10.1186/s13015-014-0031-3. Each element of the data set is two binary co-phylogentic trees. “Caryophyllaceae & Microbotryum” and “Stinkbugs & Bacteria” are missing but our dataset includes all other files from “https://doi.org/10.1007/978-3-319-73915-1_27”. An additional dataset from “https://almob.biomedcentral.com/articles/10.1186/s13015-014-0031-3”, called “Wolbachia”, is included.

Established Network Repositories

A popular choice is to use datasets from Established Network Repositories. These are ample collections, often organized in dedicated websites which also offer a few stats about the contained graphs. Because their hosts are already dedicated to the maintaining and reporting of information on these collections, we do not include here any storage of the data (which would be redundant) or report statistics on them. Rather, our analysis here is focused on highlighting their properties, origins, and ways in which they have been used.

Compiled by the Mathematical and Computational Sciences Division of the Information Technology Laboratory of the National Institute of Standards and Technology, the Matrix Market is a repository of test data for use in comparative studies of algorithms for numerical linear algebra, featuring nearly 500 sparse matrices from a variety of applications, as well as matrix generation tools and services. It has been used for experiments with generic graphs, large graphs, and graphs with weighted edges. Each matrix entry is accompanied by metadata that includes the matrix name and identifier, dimensions (number of rows and columns), number of non-zero elements, symmetry type (symmetric, skew-symmetric, or general) and data type (real, complex, integer, or pattern). The website also provides visualizations of the matrices, helping users understand their structure and distribution of non-zero elements. Downloads are also provided in a variety of formats, including their own Matrix Market Exchange (MME) format, Harwell-Boeing, and MATLAB.


Download formatted data:

Note: A repository of test data for use in comparative studies of algorithms for numerical linear algebra, featuring nearly 500 sparse matrices from a variety of applications, as well as matrix generation tools and services.



Origin paper:
Matrix Market: a web resource for test matrix collections

Usage examples:
  Aesthetic Discrimination of Graph Layouts
generic
  Drawing graphs by eigenvectors: theory and practice
generic
  Eigensolver methods for progressive multidimensional scaling of large data
generic
large
  Graph Drawing by Stress Majorization
generic
weighted edges
  Sublinear Time Force Computation for Big Complex Network Visualization
large

Descriptions from Literature

The first example is the 1138Bus graph (|V|=1138, |E|=1458) from the Matrix Market repository [1]. This graph models a network of high-voltage power distribution lines. Figure 4 shows two layouts of this graph.

Example Figures

From “Drawing graphs by eigenvectors: theory and practice” (top two layouts are a Matrix Market graph):

The Network Repository was proposed in 2015 by Rossi and Ahmed of Purdue University as a visually interactive real world graph tool and database. Graphs in this repository are tagged with their real-world source, have in-depth analysis, and even a preview visualization of each individual graph. Types of graphs include social networks, infrastructure networks, biological networks, and many others. The repository also offers sources for individual graphs. It contains many graphs from other benchmark sets described here.

Originally found at: https://networkrepository.com/

Origin paper:
The Network Data Repository with Interactive Graph Analytics and Visualization

Usage examples:
Drawing Big Graphs Using Spectral Sparsification
generic
large
A Distributed Multilevel Force-Directed Algorithm
generic
large


Descriptions from Literature

From “A Distributed Multilevel Force-Directed Algorithm”:

To evaluate the running time of MULTI-GILA, we used the REALGRAPHS and BIGGRAPHS benchmarks, which contain larger graphs extracted from both real-world applications and synthetic generators; see Table 4. The REALGRAPHS set includes the 5 largest real-world graphs (mainly scale-free graphs) used in the experimental study of GILA [5]. These graphs are taken from the Stanford Large Networks Dataset Collection [3] and from the Network Data Repository [46], and their number of edges is between 121523 and 1541514.

From “Drawing Big Graphs Using Spectral Sparsification”:

We used three data sets. The first set of graphs is taken from “defacto-benchmark” graphs, including the Hachul library, Walshaw’s Graph Partitioning Archive, the sparse matrices collection [6] and the network repository [24]. These include two types of graphs that have been extensively studied in graph drawing research: grid-like graphs and scale-free graphs.

The Pajek Program for Large Network Analysis is a tool developed and hosted by Andrej Vlado and some of their colleagues. As part of this program, they later compiled relevant graphs and links to other datasets, which we call today the Pajek collection. As a curiosity, pajek means spider in Slovenian. Many of Pajek graphs have been included as part of the SuiteSparse Matrix Collection.


Origin paper:
Pajek datasets

Usage examples:
Energy Models for Graph Clustering
clusters (generated)
generic
A Quality Metric for Visualization of Clusters in Graphs
clusters (generated)


Descriptions from Literature

From “Energy Models for Graph Clustering”:

Airline Routing (Figure 5): The nodes of this graph represent US airports, and the (unweighted) edges represent direct flights. The probability that two airports are connected by a direct flight is strongly related to their geographical distance: Most direct flights are relatively short, and only few large hub airports are connected by direct long-range flights. The distances in the edge-repulsion LinLog layout resemble the relative geographical distances of the airports remarkably closely, given that the graph does not contain any explicit information about geographical distances.

Dictionary (Figure 7): The nodes represent terms in the Online Dictionary of Library and Information Science (ODLIS), and the edges represent hyperlinks. A hyperlink between two terms exists if one term is used to describe the meaning of the other term, and thus connects semantically related terms. The edge-repulsion LinLog layout indeed groups semantically related terms, which is better reflected in the VRML file on the supplementary web page than in Figure 7(c). Such a grouping is useful, for example, for discovering the global topic areas (like publishing, printing, information technology, etc.), for identifying entry points for the exploration of topics, or for finding semantically related terms even if they are not explicitly linked.

From “A Quality Metric for Visualization of Clusters in Graphs”:

We re-used some datasets from the validation experiments and created some new ones, listed in Table 2. We also selected real world graph datasets with existing vertex categorization, which are listed under the double line in Table 2. The datasets were taken from Pajek [2] and Stanford Network Analysis Project’s (SNAP) repository [23, 40].

Example Figures

From “Energy Models for Graph Clustering”:

The SNAP repository is a collection of datasets assembled as part of the Stanford Network Analysis Platform, which started in 2004 and launched their website in 2009. Well-known, widely used graph repository. A number of graphs that became relevant individually are included in SNAP, such as Enron, Amazon, Protein Interactions datasets and various Social Network graphs. SNAP contains generic graphs, directed and undirected graphs, dynamic graphs and more. Graphs are presented with name and descriptions and a few statistics such as a general description, numbers of nodes and edges, source and reference information.


Size: 1008 unique graphs (excluding graph classification tasks consisting of 352,536 graphs)

Origin paper:
SNAP Datasets: Stanford Large Network Dataset Collection

Usage examples:
A Distributed Multilevel Force-Directed Algorithm
generic
large
A Quality Metric for Visualization of Clusters in Graphs
clusters (generated)
ForceAtlas2
a Continuous Graph Layout Algorithm for Handy Network Visualization Designed for the Gephi Software
A Random Sampling O
Force-Directed Graph Layouts Revisited: A New Force Based on the T-Distribution
generic
Shape-Faithful Graph Drawings
almost proximity drawable graphs
mesh graphs
scale-free graphs
strong proximity drawable graphs
weak proximity drawable graphs


Descriptions from Literature

From “ForceAtlas2, a Continuous Graph Layout Algorithm for Handy Network Visualization Designed for the Gephi Software”:

We benchmarked our algorithm with a dataset of 68 networks from 5 to 23,133 nodes. We tried to gather varied networks corresponding to the actual use of Gephi (a lot of social networks, and scale-free networks in general). Most of these networks are from the Stanford Large Network Dataset Collection (http://snap. stanford.edu/data/) and include social networks (Facebook and Twitter ego-networks), collaboration networks (from Arxiv) and autonomous systems (peering information inferred from Oregon route-views).

Example Figures

From “ForceAtlas2, a Continuous Graph Layout Algorithm for Handy Network Visualization Designed for the Gephi Software”:

Compiled by Donald Knuth, the Stanford Graphbase consists of programs and datasets for network analysis. The datasets accompany a textbook, “The Stanford GraphBase: A Platform for Combinatorial Computing”.


Origin paper:
The Stanford GraphBase: A Platform for Combinatorial Computing

Usage examples:
Graph Layouts by t-SNE
clusters (generated)
generic
planar
spatial
Anisotropic Radial Layout for Visualizing Centrality and Structure in Graphs
clusters (generated)
weighted edges
Crossing Minimization in Storyline Visualization
dynamic
dynamic (discrete)
layered graphs
n-layers
2-Layer Straightline Crossing Minimization
bipartite
layered graphs
Simple and Efficient Bilayer Cross Counting
bipartite
layered graphs
A Random Sampling O


Descriptions from Literature

From “Crossing Minimization in Storyline Visualizations”:

Since the instances “Anna Karenina” and “Les Miserables” are very big, we have split them into chapters and sequences of chapters. The resulting test-bed is made of eight chapters, seven pairs of chapters, six triples of chapters and five quadruples of chapters from “Anna Karenina”, and five chapters, four pairs of chapters and three triples of chapters from “Les Mis ́erables”, plus the entire “Adventures of Huckleberry Finn”, “Inception-sf”, “Inception”, “Star Wars”, “The Matrix-sf”, and “The Matrix”.

From “Anisotropic Radial Layout for Visualizing Centrality and Structure in Graphs”:

The third dataset is a graph of character associations in the famous French novel Les Miserables (Fig. 5) [18]. This graph consists of 77 nodes, each representing a character in the novel, and 254 weighted edges where the weights represent the number of chapters that feature both characters associated with an edge. We see the that the main protagonist Valjean (marked in red) is placed prominently in all three visualizations (Fig. 5). However, other key characters in the plot such as Inspector Javert (blue) and Cosett (orange), who do not appear often with characters other than the protagonist (and thus have low betweenness centrality) are treated differently. While the radial layout relegates them to the periphery (far from Valjean) (Fig. 5b), MDS (Fig. 5a) paints a conflicting picture with regard to their centrality, e.g., Cosett’s node almost overlaps with Valjean despite its low centrality. In contrast, the proposed ARL (Fig.5c) is able to coherently convey the low centrality of the Inspector Javert and Cosett, as well as, their closeness to Valjean. The above issue of distance distortion appears to be a frequent occurrence in the radial layout due to many characters who have a low centrality value causing them to end up being packed in the outer periphery. A case of contrast is that of the character Bishop Myriel (green) who despite being associated with several characters, is only seen with Valjean once.

Example Figures

From “Crossing Minimization in Storyline Visualizations”:

From “Graph Layouts by t-SNE”:

lesmis is the GraphBase Les Miserables graph

From “Anisotropic Radial Layout for Visualizing Centrality and Structure in Graphs”:

The SuiteSparse Matrix Collection, formerly known as the University of Florida Sparse Matrix Collection, is a comprehensive repository of 2893 sparse matrices. All graphs in SuiteSparse belong to groups which will have more information about the graphs and the sub-collections they belong to. In our Descriptions from the Literature section we also highlight a few tables with the specific graphs used in a couple of papers. From “The university of Florida Sparse Matrix Collection”, Davis and Hu describe the origin of this network repository. Namely they cite the Harwell-Boeing collection as the starting point for SuiteSparse, then called the University of Florida (UF) Sparse matrix collection, back in 1991. Other groups, or collections, have then been added to SuitseSparse through the years, mainly focusing on real-world matrices and other relevant problems related to them.

Originally found at: https://sparse.tamu.edu/

Size: 2893

Origin paper:
The University of Florida Sparse Matrix Collection

Usage examples:
A Maxent-Stress Model for Graph Layout
generic
A Sparse Stress Model
large
weighted edges
Drawing Big Graphs Using Spectral Sparsification
generic
large
Multi-level Graph Drawing Using Infomap Clustering
multilevel
Stochastic Gradient Descent Works Really Well for Stress Minimization
generic
Graph Layouts by t-SNE
clusters (generated)
generic
planar
spatial
Graph Drawing by Stochastic Gradient Descent
generic
DRGraph: An Efficient Graph Layout Algorithm for Large-scale Graphs by Dimensionality Reduction
generic
large
Force-Directed Graph Layouts Revisited: A New Force Based on the T-Distribution
generic
Spherical Graph Drawing by Multi-dimensional Scaling
generic
Shape-Faithful Graph Drawings
almost proximity drawable graphs
mesh graphs
scale-free graphs
strong proximity drawable graphs
weak proximity drawable graphs


Descriptions From Literature

From “A Sparse Stress Model”:

We conducted our experiments on a series of different graphs, see Tab. 1, most of them taken from the sparse matrix collection [9]. We selected these graphs as they differ in their structure and size, and are large enough to compare the results of different techniques. Two of the graphs, LeHavre and commanche, have predefined edge lengths that were derived from the node coordinates. We did not modify the graphs in any way, except for those that were disconnected, in which case we only kept the largest component.

From “A Maxent-Stress Model for Graph Layout”:

With the exception of graph gd, which is an author collaboration graph of the International Symposium on Graph Drawing between 1994 and 2007, the graphs used are from the University of Florida Sparse Matrix Collection [9]. Our selection covers a range of graph sizes, and includes mesh-like and other nonmesh graphs, and graphs from Brandes and Pich’s experimental study of distance scaling [5].

Table 2. Test Graphs

From “DRGraph: An Efficient Graph Layout Algorithm for Large-scale Graphs by Dimensionality Reduction”:

We perform experiments on a broad range of datasets selected from the University of Florida Sparse Matrix Collection [10] and tsNET [34] (Table 1).

Table 1. Test Datasets

Example Figures

From “The university of Florida sparse matrix collection”:

Fig. 13. A sample of matrices from the Collection, for the purpose of illustrating the complexity and diversity of matrices arising in real applications

From “DRGraph: An Efficient Graph Layout Algorithm for Large-scale Graphs by Dimensionality Reduction”:

Fig. 8. Visualizations of selected graph datasets using FR, KK, S.M., FM3, SFDP, PMDS, tsNET and DRGraph.

Single Graphs

A number of papers used individual, Single Graphs for their experiments instead of a collection. These graphs have become popular because of their properties as individual graphs - see, for example, the Enron dataset below, a huge individual graph that has a large variance in node degree distribution. Many of these graphs are also available in other repositories - their locations are noted wherever known.

Data made public by the Federal Energy Regulatory Commission when investigating Enron. It has had a few changes over time documented in the link provided as its source. The data is currently hosted by William W. Cohen from CMU on a webiste, and it is also hosted on SNAP. SNAP asks to cite the paper linked in Origin paper. Very high variance in node degrees. Data can be considered threaded and directed, although SNAP provides a version of the network that is explicitly undirected. This dataset proves useful for evaluating algorithms that work on very large graphs, due to its very large size.

Originally found at: https://www.cs.cmu.edu/~enron/

Size: The data consists of 150 Enron executives, who sent 500,000 messages between themselves.

Origin paper:
Introducing the Enron Corpus

Usage examples:
An Efficient Framework for Generating Storyline Visualizations from Streaming Data
dynamic
dynamic (discrete)
layered graphs
n-layers
Reordering massive sequence views: Enabling temporal and structural analysis of dynamic networks
dynamic
dynamic (discrete)
layered graphs
n-layers


Descriptions from Literature

From “Reordering massive sequence views: Enabling temporal and structural analysis of dynamic networks”:

We cleaned the data set by removing duplicates, spam and only to contain internal communication between Enron employees annotated with employee function leaving us with 151 nodes (employees) and 21374 edges (emails). Visualization using a node-link diagram enables the identification of stronger connections (see Figure 9(a)). However, temporal patterns and the evolution of the network cannot be analysed. From the standard MSV (Figure 9(b)) it becomes somewhat clear that transaction density increases over time and we can distinguish between different phases. We cannot, however, see features and identify communities due to visual clutter

Example Figures

From “An Efficient Framework for Generating Storyline Visualizations from Streaming Data”:

From “Reordering massive sequence views: Enabling temporal and structural analysis of dynamic networks”:

Data contains links related to an online search of the word “California” as nodes - the edges are links between webpages. We found this dataset on the webpage of a 2002 computer science course at Cornell, instructed by Jon Kleinberg. The original name was Pages that match the word “California”. They mention how the data was queried from a search engine, and that many of the original links are broken. It is not clear whether Kleinberg collected the data, or compiled it for the course.


Download formatted data: GEXF GML GRAPHML JSON

Size: 4772 nodes, 8965 edges

Origin paper:
The Structure of Information Networks

Usage examples:
A Treemap Based Method for Rapid Layout of Large Graphs
clusters (generated)
generic
large
Rapid Graph Layout Using Space Filling Curves
generic
large


Statistics

−20020406080100120140160180Node Degree02004006008001,0001,2001,4001,6001,8002,0002,200Count of Records

Descriptions from Literature

From “A Treemap Based Method for Rapid Layout of Large Graphs”:

This particular graph is a non-weighted graph of links between search results for the word “California” (also in Figures 6, 8, and 9, |V|=6107,|E|=15160).

From “Rapid Graph Layout Using Space Filling Curves”:

The “california” dataset (shown in Figures 5 and 8) consists of the links between the webpages found from a search for the word ‘California’ [4].

Example Figures

From “A Treemap Based Method for Rapid Layout of Large Graphs”:

Fig. 1. A graph laid out using our treemap based approach. This graph portrays the links between websites that came from a search on the word “California” [7]. Nodes are clustered into a hierarchy, and laid out by applying a treemap to this hierarchy. Levels of the hierarchy below a threshold are clustered together into larger nodes. It can very easily be seen that there are three primary groups of websites that link to each other, and a plethora of others that are not as tightly linked.

From “Rapid Graph Layout Using Space Filling Curves”:

Fig. 5 Separating clusters. By adjusting the spacing between nodes according to the clustering information, clusters can be separated.

Data collected by the WRI. Linked data is from a different year than the data used by Optimal Sankey Diagrams Via Integer Programming, for which data could not be found. Contains weighted edges, weighted nodes, categorical nodes, and layers.


Download formatted data: GEXF GML GRAPHML JSON

Size: 40 nodes, 85 edges

Origin paper:
Navigating the Numbers: Greenhouse Gas Data and International Climate Policy

Usage examples:
Optimal Sankey Diagrams via Integer Programming
clusters (pre-existing)
layered graphs
n-layers


Statistics

0.02.04.06.08.010.012.014.0Node Degree (binned)0246810121416182022Count of Recordswri_data.json

Descriptions from Literature

We have tested our method on the “World Greenhouse Gas Emissions” data from the World Resources Institute [8]. After transforming the “long” edges of the graph into “short” edges and adding the corresponding dummy nodes, as described in Section 3, this example has 4 layers, 55 nodes and 100 edges. The node ordering in the layout of the Sankey diagram shown in Figure 1 has been computed using Sugiyama’s heuristic method [13].

Example Figures

From Optimal Sankey Diagrams via Integer Linear Programming:

127.0.0.1:5501:150:21

20,058 chess openings from games in a database of played online chess games. The authors of “Sequence Braiding: Visual Overviews of Temporal Event Sequences and Attributes” collected the dataset online from the data science platform kaggle. The dataset was collected by user Mitchell J. and collects data from the chess website Lichess.org. The data used in the literature focuses on the 200 most common openings. These are all sequences of 6 moves and are topologically equivalent. It is a layered graph with categorical nodes and layers, and can be seen as a temporal event sequence.


Download formatted data: GEXF GML GRAPHML JSON

Size: The data consists of 20058 sequences, which result in a disconnected graph of 118164 nodes, and 98106 edges.

Origin paper:
Chess Game Dataset (Lichess)

Usage examples:
Sequence Braiding: Visual Overviews of Temporal Event Sequences and Attributes
dynamic
dynamic (discrete)
layered graphs
n-layers


Descriptions from Literature

From “Sequence Braiding: Visual Overviews of Temporal Event Sequences and Attributes”:

200 Chess openings displayed with Sequence Braiding. Each line represents a sequence of moves of the white player, each group is a chess piece type. Most openings start with a pawn, and very little with the knight.

Example Figures

From “Sequence Braiding: Visual Overviews of Temporal Event Sequences and Attributes”:

Fig. 7. 200 Chess openings displayed with Sequence Braiding. Each line represents a sequence of moves of the white player, each group is a chess piece type. Most openings start with a pawn, and very little with the knight. After moving a pawn, it is common to move a knight or a pawn, it is a little less common to move a bishop, and only a little number of openings move the queen on the second move.

Part of the http://tolweb.org/tree/to characterize information about biodiversity and their evolutionary genealogy. It is a tree tree with categorical nodes.


Download formatted data: GEXF GML GRAPHML JSON

Size: 35,960 nodes, 35,959 edges

Origin paper:
Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation
The Tree of Life Web Project

Usage examples:
Visualizing Evolving Trees
dynamic
dynamic (discrete)


Descriptions from Literature

From “Visualizing Evolving Trees”:

The Tree of Life: captures the evolutionary progression of life on Earth [37]. The underlying data is a tree structure with a natural time component. As a new species evolves, a new node in the tree is added. The edges give the parent-child relation of the nodes, where the parent is the original species, and the child is the new species. We use a subset of this graph with 500 nodes. The maximum node degree of this tree is 5, and the radius is 24

Example Figures

From “Visualizing Evolving Trees”:

Fig. 4. Layouts obtained by the seven methods for the tree of life dataset.

Bilateral trade value in 1999 (total imports+exports), from wits.worldbank.org. It is dense and has categorical nodes.


Download formatted data: GEXF GML GRAPHML JSON


Size: 194 nodes, 10080 edges

Origin paper:
The WTO promotes trade strongly but unevenly

Usage examples:
Energy Models for Graph Clustering
clusters (generated)
generic
Untangling the Hairballs of Multi-Centered Small-World Online Social Media Networks
generic
large


Statistics

020406080100120140160180200Node Degree (binned)0510152025303540Count of Recordstrade_data.json

Descriptions from Literature

From “Energy Models for Graph Clustering”:

The difference between conventional energy models, node-repulsion LinLog, and edge-repulsion LinLog can be illustrated with a model of the trade between ten North American and European countries. The nodes of the graph correspond to the countries, and the edge weights specify the trade volume between each pair of countries. Because of geographical closeness and free trade agreements, countries on the same continent trade more intensively than countries on different continents. Figure 1 shows the minimum energy layouts of the trade graph for the three force and energy models. The layout of the widely used Fruchterman-Reingold model [20] does not show any clear groups at all. The layout of the node-repulsion LinLog energy model groups the countries (nodes) primarily according to their total trade volume (degree). Only the layout of the edge-repulsion LinLog energy model shows the expected grouping according to continents.

Example Figures

From “Untangling the Hairballs of Multi-Centered Small-World Online Social Media Networks”:

From “Energy Models for Graph Clustering”:

Data collected in a survey of Ohio State University students by D. W. Malone. Students were presented with two potential obstacles to investment in Columbus, Ohio’s business district and asked, “does obstacle A aggravate or intensify obstacle B?”. Contains directed edges and is hierarchical.

Download formatted data: GEXF GML GRAPHML JSON

Size: 25 nodes, 30 edges

Origin paper:
An introduction to the application of interpretive structural modeling

Usage examples:
Methods for Visual Understanding of Hierarchical System Structures
hierarchical
layered graphs
n-layers


Statistics

024681012141618202224Component Size0.00.51.01.52.02.53.00123456Node Degree01234567891011

Description from Literature

From Methods for Visual Understanding of Hierarchical System Structures:

Presented in Fig. 5 are drawings of the nine-level hierarchy which appeared in Malone [11] and represent the interdependence of obstacles to investment in the Columbus, Ohio, central business district.

Example Figures

From Methods for Visual Understanding of Hierarchical System Structures:

Fig. 5. (cropped): Maps of the nine-level hierarchy [11]. (a) Initial. (b) BC + QP methods (c = 1.0). (c) BC + QP methods (cL= 0.5). (d) BC + QP methods (c = 0.05). (e) BC + PR methods.

Protein Interaction Publications were collected from the Biological Pathway Commons Database. Various types of protein interaction graphs are recorded in other network collections. We highlight two of them from the surveyed papers. First, we have the temporal Protein Interaction Publications, showing the history of how protein interactions were described in the literature. Second, the Protein Homology graph was provided by the Large Graph Layout project, their links are now broken and the dataset lost. Nonetheless, SNAP and Konect both also have various dataset of human/other species protein interactions.


Download formatted data: GEXF GML GRAPHML JSON

Size: 2961 nodes, 5267 edges

Origin paper:
LGL: Creating a Map of Protein Function with an Algorithm for Visualizing Very Large Biological Networks
Usage examples:
Rapid Graph Layout Using Space Filling Curves
generic
large
TimeArcs: Visualizing Fluctuations in Dynamic Networks
dynamic
dynamic (discrete)
layered graphs
n-layers


3.2 Protein Interaction Publications

Statistics

050100150200Node Degree02004006008001,0001,2001,4001,6001,800Count of Records

Description from the Literature

From TimeArcs: Visualizing Fluctuations in Dynamic Networks:

The data contains the publication information (such as publication year, author, and textual evidence) of interactions between pairs of proteins, as well as their specific interaction type.

When there are multiple arcs connecting two proteins, it falls into one of the two circumstances. If they have the same color, these arcs indicate that there are supporting evidences in different publications which confirm the interaction between two elements. On the other hand, if they have the different colors, the more recent appearance provides either more detailed knowledge about the interaction or shows a conflict between different articles regarding the way in which these proteins interact.

Example Figures

From TimeArcs: Visualizing Fluctuations in Dynamic Networks:

Fig. 10: TimeArcs visualization for interactions around PCAF protein. (1), (2), and (3) in the figure are supporting evidences in literature of “PCAF binds MAML”.

3.3 Protein Homology (Lost)

Description from the Literature

From [Rapid Graph Layout Using Space Filling Curves:](https://ieeexplore.ieee.org/document/4658143)

The “pgraph” dataset (shown in Figures 2 and 6) is a protein homology graph, which is relatively dense [7].

Consists of 28, 854 vertices and 1,180, 816 edges, found in Table 1 of paper above.

Example Figures

From Rapid Graph Layout Using Space Filling Curves:

Fig. 2. A protein homology graph laid out with our space filling curve based approach. Color corresponds to depth in the clustering hierarchy. |V| = 28, 854,|E| = 1, 180, 816

From LGL: Creating a Map of Protein Function with an Algorithm for Visualizing Very Large Biological Networks

Fig. 3. The complete protein homology map. A layout of the entire protein homology map; a total of 11,516 connected sets containing 111,604 proteins (vertices) with 1,912,684 edges. The largest connected set is shown more clearly in the inset and is enlarged further in Figure 4.

Aggregate Collections

Many papers used graphs from specific domains that contain particular characteristics (e.g., geographical coordinates are often found in airline data). Instead of collecting each one of these individual, contextual datasets, we aggregated them in further subcategories, and called them Aggregate collections. Individual information about about each aggregate collection can be found in the papers that contain them.

//| echo: false

make_sparkline("Code_DependencyGraphs")
//| echo: false

make_sparkline("Social_Networks")

Subsets of other collections

Finally, we collected some datasets that we noticed being subsets of existing collections. This is a phenomenon that can happen through the years, through the redistribution and through the merging of different sources: the Walshaw dataset, for instance, was and still is distributed and cited as its own standalone dataset, but its graphs can be found as part of many other larger collections, such as SNAP. We classified these datasets as Subsets.

Lost and unavailable datasets

Unfortunately, some of the datasets that were used in the papers in our corpus are lost, or not available anymore. While we did go through the effort, for each one of them, to recover them and store them on OSF, we could not find anywhere the following list of datasets:

Custom-made datasets

In the data we collected, we also found several instances of custom-made datasets. We consider custom-made datasets either edits to pre-existing datasets, where the authors found it necessary to either split or modify the dataset in a particular way, or datasets completely made up from scratch using random generators or custom-made code. This can happen in cases where the authors of a paper needed a dataset containing particular characteristics which was not easy to find in the wild, so a new dataset was crafted.

For instance, consider the case where the authors of a paper develop an algorithm that works on hypergraphs. They want to test that the algorithm works, and test its performance on hypergraphs of various sizes, but datasets containing hypergraphs are difficult to find. For this reason, the authors craft one dataset synthetically, or take a pre-existing dataset and edit it so that it now contains hyperedges.

We split custom-made datasets in three categories, with their occurrences in the corpus of papers illustrated below:

Replicable datasets indicate cases where the authors have given enough information so that the experiment can be replicated exactly as it was run by the authors of a paper, or closely enough that the results obtained reflect the published ones very closely. This includes cases where either the authors published the entire dataset they used, they published the code they used to generate the dataset, or include an exact description of the steps they took to generate it.

Reproducible datasets are cases where the authors described the steps they took to generate and/or edit their datasets, but not in-depth enough so that the exact same graphs can be reproduced, and did not redistribute it. Results can still be reproduces somewhat closely if the authors took care to report enough information about their graphs.

For non-replicable datasets, we indicate cases where the authors did not distribute their datasets and did not include enough information in the paper so that their results could be replicated.

This information is closely tied to the distribution of supplemental material in papers, that is shown in the chart below:

This discussion is part of a larger discourse on research replicability, that is gaining traction in the scientific community. The ACM, for instance, has a policy on artifact review and badging, where authors are encouraged to submit their artifacts for review, and if they pass, they receive a badge that indicates the artifact is available for review. This is a step towards making research more replicable and reproducible, and we hope that our work will contribute to this effort.

See, e.g., ACM’s definitions at https://www.acm.org/publications/policies/artifact-review-and-badging-current.

3.4 Details of the datasets

For each of the datasets collected as part of our process, we conducted a brief analysis of their contents. Where possible, the analysis includes information about the number of nodes per graph, the source of the dataset, which papers have used the dataset and what graph features they took into account. The following section defaults to illustrating the contents of Rome-Lib, but the reader can easily change the dataset in focus by modifying the value of the dropdown menu below:

3.5 Random generation

We discussed above that in a few cases, authors have generated their own datasets to test their algorithms. We call these generated datasets synthetic or custom, as opposed to datasets found in the wild. Generating a synthetic dataset can be essential for several reasons, particularly in the context of algorithm development and validation:

  • Privacy and Publication Constraints: Imagine you’ve created an innovative algorithm using a dataset that, due to privacy concerns, cannot be shared publicly. To validate and share your work without compromising data confidentiality, creating a synthetic dataset that mirrors the essential features of the original data allows you to showcase your algorithm’s effectiveness while adhering to privacy restrictions.
  • Benchmarking and Comparative Analysis: Suppose you have access to a single dataset that can be shared and wish to prove your algorithm’s robustness across various scenarios. Generating synthetic datasets with comparable characteristics provides a controlled environment to conduct benchmark studies. This approach enables scientists to demonstrate that your algorithm performs consistently well across datasets with analogous features, thereby reinforcing its applicability and reliability.
  • Addressing Data Availability Issues: In certain situations, you might design an algorithm to process graphs with specific attributes only to discover the absence of publicly available datasets showcasing these features. Synthetic datasets become invaluable here, allowing you to create tailored data that incorporates the necessary characteristics. This approach not only facilitates the testing of your algorithm in a relevant context but also helps in illustrating its potential in hypothetical yet plausible scenarios.

We use the word synthetic too in cases where the dataset has been altered, sliced, or other modifications have been applied to it.

The following is a practical example of a case where you would need to edit a network:

Imagine you are building a visualization to deal with the entirety of the Enron Corpus dataset , which has hundreds of thousands of nodes and edges. Because of its size, you decide to slice up the large dataset in many smaller files, so you can run tests. This particular dataset, in addition to being a challenging problem because of its size, also has an interesting distribution of connection densities: some nodes are extremely well connected, while others are much less connected. Indeed, the dataset is comprised of a collection of emails sent by Enron executives - between themselves or between other employees of the company. Because of how the dataset is constructed - where only the emails from the executives are taken into account - it has a distinctive skew in the connectedness of the data: 158 nodes are extremely well connected, while the rest of the nodes are much less connected. Because of this, there’s uncountable mistakes that can be done through slicing this graph: for instance, slicing it so that the subset includes a less dense section of the entire graph will fail to provide a representative section.

While the creation of a synthetic dataset is a perfectly viable way to produce a benchmark set, in addition to replicability criteria (as discussed above), we also have to pay particular attention to a number of statistics related to the structure of graphs — both when generating new datasets, and when slicing up existing datasets to reduce their size, or to create more graphs from a single, larger graph.

The list of features to take into account to claim that a synthetic graph is comparable to another one would be long, and perhaps out of the scope of this publication. These are just a few examples of what could be relevant:

  • Size (nodes, edges): The total number of nodes and edges in the graph, indicating its overall scale.
  • Diameter: The longest shortest path between any two nodes, showing the graph’s maximum extent.
  • Density: The ratio of existing edges to possible edges, reflecting how closely knit the graph is.
  • Motifs: Recurring, significant subgraphs; identify which motifs occur more or less frequently than expected.
  • Connectedness: Evaluates both the number and sizes of graph components, along with detailed analysis per component.
  • Centrality Measures (betweenness, closeness, degree): Quantify the importance of nodes based on their position and connections within the graph.
  • Special Cases:
    • Layers: Characteristics like number of layers, nodes per layer, and inter-layer connections.
    • Disjoint Groups: The count and size distribution of separate clusters, plus an analysis of each group’s properties.
    • Overlapping Sets: Number and size distribution of intersecting groups, along with detailed features.
    • Additional Node/Edge Data: Information such as timestamps, attributes, or weights that add context to nodes/edges.
    • Dynamic: Describes changes in the graph over time, including the nature and frequency of node/edge modifications.

The generation of random graphs that accurately mimic specific features presents a complex challenge, that has been explored abundantly in the past, but has found no universal solution yet. Two examples of popular models used to generate synthetic datasets are the Erdős-Rényi (ER) and the Barabási-Albert (BA) models, each one with their distinct focus and limitations:

The ER model excels in creating graphs with uniform edge distribution, ideal for testing an algorithm’s ability to evenly space nodes and reduce edge crossings. However, it falls short in replicating the complex, non-uniform connections seen in real-world networks, limiting its applicability to scenarios requiring simplicity and randomness. Conversely, the BA model produces scale-free networks with a few highly connected nodes, reflecting the hierarchical structure of many natural and human-made systems. It challenges layout algorithms to effectively display these hubs without cluttering. The limitation here is its focus on growth and preferential attachment, which might not suit all types of networks, particularly those without clear hub structures.

As there is no universal solution for random graph generation, our recommendation is to try as much as possible to pay attention to research replicability criteria, such as redistributing the generated dataset as supplemental material in the paper.

4 Conclusion

We presented a comprehensive benchmark dataset collection for graph layout algorithms. Compiling, organizing, and making a wide array of datasets with diverse characteristics accessible not only facilitates rigorous and fair comparisons of algorithmic performance but also addresses critical issues of replicability and reproducibility in research. The Graph Benchmark Datasets website, along with the efforts for long-term archival, ensures these valuable resources remain available to the community. Web aim to help both current and future researchers to benefit from a rich repository of data, standardizing the evaluation process for graph layout algorithms, fostering innovation and development within the field and contributes to the advancement of graph drawing and visualization research. With this work, we hope to underscore the importance of accessible, well-documented benchmark datasets in driving scientific progress and highlights our commitment to enhancing the integrity and reliability of computational evaluations in the Graph Drawing community.

5 Research Material Statements

The supplemental material for this publication includes:

  • The Graph Benchmark Datasets website, hosted on GitHub Pages.
  • Code for the website, metadata on the benchmark datasets, and data processing scripts are stored in the gd_benchmark_sets GitHub repository. TODO (CD 2024-03-08): clean up and add documentation for how to use. Esp. clean the data folder, readme.md, the random .ipynb files. Add LICENCE file specifying ideally Apache 2.0 for code, and info to readme.md specifying Apache 2.0 for code and CC BY 4.0 for data & metadata.
  • This code and metadata, along with the benchmark datasets themselves in several common graph file formats, are archived for long-term availability in the Benchmark datasets for graph drawing project on OSF. TODO (CD 2024-03-08): add the website code + data. This can be done by creating a registration on OSF. I linked the GitHub repo as a data source on OSF, but the registration is what copies it to OSF as an archived snapshot.
  • The original database used for metadata collection and storage is available on Notion as Benchmark datasets.

6 Acknowledgements

This work was supported in part by the U.S. National Science Foundation (NSF) under award number CAREER IIS-1762268 and the Austrian Science Fund (FWF) [ESP 513-N].

7 Authorship

  • Sara Di Bartolomeo orcid_16x16@2.gif : Conceptualization, Data Collection & Categorization, Writing Original Draft + Review & Editing, Visualization, Validation.
  • Eduardo Puerta orcid_16x16@2.gif : Data Collection & Categorization, Writing Original Draft + Review & Editing, Validation.
  • Connor Wilson orcid_16x16@2.gif : Data Collection & Categorization, Writing Original Draft + Review & Editing, Validation.
  • Tarik Crnovrsanin orcid_16x16@2.gif : Data Collection & Categorization, Writing Original Draft + Review & Editing, Validation.
  • Cody Dunne orcid_16x16@2.gif : Supervision, Writing Original Draft + Review & Editing, Validation, Funding Acquisition.

8 License

This work is licensed under a Creative Commons Attribution 4.0 International License. All code is released under the Apache 2.0 License.

9 Conflict of Interest

The authors declare that there are no competing interests.

References

Adai, Alex T., Shailesh V. Date, Shannon Wieland, and Edward M. Marcotte. 2004. “LGL: Creating a Map of Protein Function with an Algorithm for Visualizing Very Large Biological Networks.” Journal of Molecular Biology 340 (1): 179–90. https://doi.org/https://doi.org/10.1016/j.jmb.2004.04.047.
Arleo, Alessio, Walter Didimo, Giuseppe Liotta, and Fabrizio Montecchiani. 2019. “A Distributed Multilevel Force-Directed Algorithm.” IEEE Transactions on Parallel and Distributed Systems 30 (4): 754–65. https://doi.org/10.1109/tpds.2018.2869805.
Bachmaier, Christian, Andreas Gleißner, and Andreas Hofmeier. 2012. “DAGmar: Library for DAGs.” https://www.infosun.fim.uni-passau.de/~chris/down/MIP-1202.pdf.
Barth, Wilhelm, Michael Jünger, and Petra Mutzel. 2002. “Simple and Efficient Bilayer Cross Counting.” In Graph Drawing, edited by Michael T. Goodrich and Stephen G. Kobourov, 130–41. Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-36151-0_13.
Bartolomeo, Sara di, Yixuan Zhang, Fangfang Sheng, and Cody Dunne. 2021. Sequence Braiding: Visual Overviews of Temporal Event Sequences and Attributes.” IEEE Transactions on Visualization and Computer Graphics 27 (2): 1353–63. https://doi.org/10.1109/tvcg.2020.3030442.
Batagelj, Vladimir, and Andrej Mrvar. 2006. “Pajek Datasets.” In. http://vlado.fmf.uni-lj.si/pub/networks/data/.
Battista, Giuseppe Di, Ashim Garg, Giuseppe Liotta, Armando Parise, Roberto Tamassia, Emanuele Tassinari, Francesco Vargiu, and Luca Vismara. 2000. “Drawing Directed Acyclic Graphs: An Experimental Study.” International Journal of Computational Geometry & Applications 10 (06): 623–48. https://doi.org/10.1142/s0218195900000358.
Battista, Giuseppe Di, Ashim Garg, Giuseppe Liotta, Roberto Tamassia, Emanuele Tassinari, and Francesco Vargiu. 1997. “An Experimental Comparison of Four Graph Drawing Algorithms.” Computational Geometry 7 (5-6): 303–25. https://doi.org/10.1016/s0925-7721(96)00005-3.
Baumert, Kevin A., Timothy Herzog, and Jonathan Pershing. 2005. “Navigating the Numbers: Greenhouse Gas Data and International Climate Policy.” World Resources Institute (WRI). https://files.wri.org/d8/s3fs-public/pdf/navigating_numbers.pdf.
Bekos, Michael A., Henry Förster, Christian Geckeler, Lukas Holländer, Michael Kaufmann, Amadäus M. Spallek, and Jan Splett. 2018. “A Heuristic Approach Towards Drawings of Graphs with High Crossing Resolution.” In Lecture Notes in Computer Science, 271–85. Springer International Publishing. https://doi.org/10.1007/978-3-030-04414-5_19.
Binucci, Carla, Markus Chimani, Walter Didimo, Giuseppe Liotta, and Fabrizio Montecchiani. 2016. “Placing Arrows in Directed Graph Drawings.” In Lecture Notes in Computer Science, 44–51. Springer International Publishing. https://doi.org/10.1007/978-3-319-50106-2_4.
Binucci, Carla, Walter Didimo, Michael Kaufmann, Giuseppe Liotta, and Fabrizio Montecchiani. 2022. “Placing Arrows in Directed Graph Layouts: Algorithms and Experiments.” Computer Graphics Forum 41 (1): 364–76. https://doi.org/https://doi.org/10.1111/cgf.14440.
Börsig, Katharina, Ulrik Brandes, and Barna Pasztor. 2020. “Stochastic Gradient Descent Works Really Well for Stress Minimization.” In Lecture Notes in Computer Science, 18–25. Springer International Publishing. https://doi.org/10.1007/978-3-030-68766-3_2.
Buchheim, Christoph, Markus Chimani, Dietmar Ebner, Carsten Gutwenger, Michael Jünger, Gunnar W. Klau, Petra Mutzel, and René Weiskircher. 2008. “A Branch-and-Cut Approach to the Crossing Number Problem.” Discrete Optimization 5 (2): 373–88. https://doi.org/10.1016/j.disopt.2007.05.006.
Chimani, Markus, and Carsten Gutwenger. 2012. “Advances in the Planarization Method: Effective Multiple Edge Insertions.” In Graph Drawing, 87–98. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-25878-7_10.
Chimani, Markus, Carsten Gutwenger, and Petra Mutzel. 2006. “Experiments on Exact Crossing Minimization Using Column Generation.” In Experimental Algorithms, edited by Carme Àlvarez and María Serna, 303–15. Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/11764298_28.
Chimani, Markus, Carsten Gutwenger, Petra Mutzel, and Hoi-Ming Wong. 2010. “Layer-Free Upward Crossing Minimization.” ACM J. Exp. Algorithmics 15 (March). https://doi.org/10.1145/1671970.1671975.
Chimani, Markus, Max Ilsen, and Tilo Wiedera. 2021. “Star-Struck by Fixed Embeddings: Modern Crossing Number Heuristics.” In Lecture Notes in Computer Science, 41–56. Springer International Publishing. https://doi.org/10.1007/978-3-030-92931-2_3.
Chimani, Markus, Karsten Klein, and Tilo Wiedera. 2016. “A Note on the Practicality of Maximal Planar Subgraph Algorithms.” In Graph Drawing and Network Visualization, edited by Yifan Hu and Martin Nöllenburg, 357–64. Cham: Springer International Publishing.
Chimani, Markus, Petra Mutzel, and Immanuel Bomze. 2008. “A New Approach to Exact Crossing Minimization.” In Algorithms - ESA 2008, 284–96. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-87744-8_24.
Chimani, Markus, and Tilo Wiedera. 2016. “An ILP-Based Proof System for the Crossing Number Problem.” In 24th Annual European Symposium on Algorithms (ESA 2016), edited by Piotr Sankowski and Christos Zaroliagis, 57:29:1–13. Leibniz International Proceedings in Informatics (LIPIcs). Dagstuhl, Germany: Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik. https://doi.org/10.4230/LIPIcs.ESA.2016.29.
Clancy, Kieran, Michael Haythorpe, and Alex Newcombe. 2019a. “A Survey of Graphs with Known or Bounded Crossing Numbers.” Australas. J Comb. 78: 209–96. https://api.semanticscholar.org/CorpusID:119313173.
———. 2019b. “An Effective Crossing Minimisation Heuristic Based on Star Insertion.” Journal of Graph Algorithms and Applications 23 (2): 135–66. https://doi.org/10.7155/jgaa.00487.
Dang, T. N., N. Pendar, and A. G. Forbes. 2016. TimeArcs: Visualizing Fluctuations in Dynamic Networks.” Computer Graphics Forum 35 (3): 61–69. https://doi.org/10.1111/cgf.12882.
Davis, Timothy A., and Yifan Hu. 2011. “The University of Florida Sparse Matrix Collection.” ACM Trans. Math. Softw. 38 (1). https://doi.org/10.1145/2049662.2049663.
Demel, Almut, Dominik Dürrschnabel, Tamara Mchedlidze, Marcel Radermacher, and Lasse Wulf. 2018. “A Greedy Heuristic for Crossing-Angle Maximization.” In Lecture Notes in Computer Science, 286–99. Springer International Publishing. https://doi.org/10.1007/978-3-030-04414-5_20.
Di Bartolomeo, Sara, Tarik Crnovrsanin, David Saffo, Eduardo Puerta, Connor Wilson, and Cody Dunne. 2024. “Evaluating Graph Layout Algorithms: A Systematic Review of Methods and Best Practices.” Computer Graphics Forum n/a (n/a): e15073. https://doi.org/10.1111/cgf.15073.
Di Bartolomeo, Sara, Mirek Riedewald, Wolfgang Gatterbauer, and Cody Dunne. 2022. STRATISFIMAL LAYOUT: A Modular Optimization Model for Laying Out Layered Node-Link Network Visualizations.” IEEE Transactions on Visualization and Computer Graphics 28 (1): 324–34. https://doi.org/10.1109/tvcg.2021.3114756.
Di Battista, Giuseppe, Ashim Garg, and Giuseppe Liotta. 1995. “An Experimental Comparison of Three Graph Drawing Algorithms (Extended Abstract).” In Proceedings of the Eleventh Annual Symposium on Computational Geometry, 306–15. SCG ’95. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/220279.220312.
Eades, Peter, Quan Nguyen, and Seok-Hee Hong. 2018. “Drawing Big Graphs Using Spectral Sparsification.” In Lecture Notes in Computer Science, 272–86. Springer International Publishing. https://doi.org/10.1007/978-3-319-73915-1_22.
Elzen, Stef van den, Danny Holten, Jorik Blaas, and Jarke J. van Wijk. 2013. “Reordering Massive Sequence Views: Enabling Temporal and Structural Analysis of Dynamic Networks.” In 2013 IEEE Pacific Visualization Symposium (PacificVis). IEEE. https://doi.org/10.1109/pacificvis.2013.6596125.
Gange, Graeme, Peter J. Stuckey, and Kim Marriott. 2011. “Optimal k-Level Planarization and Crossing Minimization.” In Graph Drawing, 238–49. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-18469-7_22.
Gansner, E. R., Yifan Hu, and S. North. 2013. “A Maxent-Stress Model for Graph Layout.” IEEE Transactions on Visualization and Computer Graphics 19 (6): 927–40. https://doi.org/10.1109/tvcg.2012.299.
Gansner, Emden R., and Yehuda Koren. 2006. “Improved Circular Layouts.” In Graph Drawing, 386–98. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-70904-6_37.
Gansner, Emden R., and Stephen C. North. 2000. “An Open Graph Visualization System and Its Applications to Software Engineering.” Softw. Pract. Exper. 30 (11): 1203–33.
Giovannangeli, Loann, Frederic Lalanne, Romain Giot, and Romain Bourqui. 2023. “FORBID: Fast Overlap Removal by Stochastic GradIent Descent for Graph Drawing.” In Graph Drawing and Network Visualization, edited by Patrizio Angelini and Reinhard von Hanxleden, 61–76. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-031-22203-0_6.
Gray, Kathryn, Mingwei Li, Reyan Ahmed, and Stephen Kobourov. 2023. “Visualizing Evolving Trees.” In Graph Drawing and Network Visualization, edited by Patrizio Angelini and Reinhard von Hanxleden, 319–35. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-031-22203-0_23.
Gronemann, Martin, Michael Jünger, Frauke Liers, and Francesco Mambelli. 2016. “Crossing Minimization in Storyline Visualization.” In Lecture Notes in Computer Science, 367–81. Springer International Publishing. https://doi.org/10.1007/978-3-319-50106-2_29.
Gutwenger, Carsten, and Petra Mutzel. 2004. “An Experimental Study of Crossing Minimization Heuristics.” In Graph Drawing, 13–24. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-24595-7_2.
Hachul, Stefan, and Michael Juenger. 2006. “An Experimental Comparison of Fast Algorithms for Drawing General Large Graphs,” 235–50. https://doi.org/10.1007/11618058_22.
Hachul, Stefan, and Michael Jünger. 2005. “Drawing Large Graphs with a Potential-Field-Based Multilevel Algorithm.” In Graph Drawing, 285–95. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-31843-9_29.
Heinsohn, Niklas, and Michael Kaufmann. 2018. “An Interactive Tool to Explore and Improve the Ply Number of Drawings.” In Lecture Notes in Computer Science, 38–51. Springer International Publishing. https://doi.org/10.1007/978-3-319-73915-1_4.
Hong, Seok-Hee, Peter Eades, Marnijati Torkel, Ziyang Wang, David Chae, Sungpack Hong, Daniel Langerenken, and Hassan Chafi. 2019. “Multi-Level Graph Drawing Using Infomap Clustering.” In Lecture Notes in Computer Science, 139–46. Springer International Publishing. https://doi.org/10.1007/978-3-030-35802-0_11.
Jabrayilov, Adalat, Sven Mallach, Petra Mutzel, Ulf Rüegg, and Reinhard von Hanxleden. 2016. “Compact Layered Drawings of General Directed Graphs.” In Lecture Notes in Computer Science, 209–21. Springer International Publishing. https://doi.org/10.1007/978-3-319-50106-2_17.
Jolly, Mitchell. 2017. “Chess Game Dataset (Lichess).” Kaggle. https://www.kaggle.com/datasets/datasnaek/chess?resource=download.
Jünger, Michael, and Petra Mutzel. 1996. “Exact and Heuristic Algorithms for 2-Layer Straightline Crossing Minimization.” In Graph Drawing, edited by Franz J. Brandenburg, 337–48. Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/BFb0021817.
Jünger, Michael, Petra Mutzel, and Christiane Spisla. 2018. “A Flow Formulation for Horizontal Coordinate Assignment with Prescribed Width.” In Lecture Notes in Computer Science, 187–99. Springer International Publishing. https://doi.org/10.1007/978-3-030-04414-5_13.
Kindermann, Philipp, Wouter Meulemans, and André Schulz. 2018. “Experimental Analysis of the Accessibility of Drawings with Few Segments.” Journal of Graph Algorithms and Applications 22 (3): 501–18. https://doi.org/10.7155/jgaa.00474.
Klammler, Moritz, Tamara Mchedlidze, and Alexey Pak. 2018. “Aesthetic Discrimination of Graph Layouts.” In Graph Drawing and Network Visualization, edited by Therese Biedl and Andreas Kerren, 169–84. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-04414-5_12.
Kleinberg, Jon. 2002. “The Structure of Information Networks.” https://www.cs.cornell.edu/courses/cs685/2002fa/.
Klimt, Bryan, and Yiming Yang. 2004. “Introducing the Enron Corpus.” In International Conference on Email and Anti-Spam.
Knuth, Donald Ervin. 2009. The Stanford Graphbase: A Platform for Combinatorial Computing. Addison-Wesley.
Kruiger, J. F., P. E. Rauber, R. M. Martins, A. Kerren, S. Kobourov, and A. C. Telea. 2017. “Graph Layouts by t-SNE.” Computer Graphics Forum 36 (3): 283–94. https://doi.org/10.1111/cgf.13187.
Leskovec, Jure, and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection.” http://snap.stanford.edu/data.
Letunic, Ivica, and Peer Bork. 2021. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation.” Nucleic Acids Research 49 (W1): W293–96. https://doi.org/10.1093/nar/gkab301.
Maddison, David, Katja Schulz, and Wayne Maddison. 2007. “The Tree of Life Web Project*.” Zootaxa 1668 (December): 19–40. https://doi.org/10.11646/zootaxa.1668.1.4.
Mallach, Sven. 2019. “A Natural Quadratic Approach to the Generalized Graph Layering Problem.” In Lecture Notes in Computer Science, 532–44. Springer International Publishing. https://doi.org/10.1007/978-3-030-35802-0_40.
Malone, D. W. 1975. “An Introduction to the Application of Interpretive Structural Modeling.” Proceedings of the IEEE 63 (3): 397–404. https://doi.org/10.1109/PROC.1975.9765.
Meidiana, Amyra, Seok-Hee Hong, and Peter Eades. 2023. “Shape-Faithful Graph Drawings.” In Graph Drawing and Network Visualization, edited by Patrizio Angelini and Reinhard von Hanxleden, 93–108. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-031-22203-0_8.
Meidiana, Amyra, Seok-Hee Hong, Peter Eades, and Daniel Keim. 2019. “A Quality Metric for Visualization of Clusters in Graphs.” In Lecture Notes in Computer Science, 125–38. Springer International Publishing. https://doi.org/10.1007/978-3-030-35802-0_10.
Miller, Jacob, Vahan Huroyan, and Stephen Kobourov. 2023. “Spherical Graph Drawing by Multi-Dimensional Scaling.” In Graph Drawing and Network Visualization, edited by Patrizio Angelini and Reinhard von Hanxleden, 77–92. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-031-22203-0_7.
Muelder, Chris, and Kwan Liu Ma. 2008a. “Rapid Graph Layout Using Space Filling Curves.” IEEE Transactions on Visualization and Computer Graphics 14 (6): 1301–8. https://doi.org/10.1109/TVCG.2008.158.
Muelder, Chris, and Kwan-Liu Ma. 2008b. “A Treemap Based Method for Rapid Layout of Large Graphs.” In 2008 IEEE Pacific Visualization Symposium, 231–38. https://doi.org/10.1109/PACIFICVIS.2008.4475481.
Nachmanson, Lev, Arlind Nocaj, Sergey Bereg, Leishi Zhang, and Alexander Holroyd. 2017. “Node Overlap Removal by Growing a Tree.” Journal of Graph Algorithms and Applications 21 (5): 857–72. https://doi.org/10.7155/jgaa.00442.
Niedermann, Benjamin, and Ignaz Rutter. 2020. “An Integer-Linear Program for Bend-Minimization in Ortho-Radial Drawings.” In Lecture Notes in Computer Science, 235–49. Springer International Publishing. https://doi.org/10.1007/978-3-030-68766-3_19.
Noack, Andreas. 2004. “An Energy Model for Visual Graph Clustering.” In Graph Drawing, edited by Giuseppe Liotta, 425–36. Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-24595-7_40.
Nocaj, Arlind, Mark Ortmann, and Ulrik Brandes. 2015. “Untangling the Hairballs of Multi-Centered, Small-World Online Social Media Networks.” Journal of Graph Algorithms and Applications : JGAA 19 (2): 595–618. https://doi.org/10.7155/jgaa.00370.
Ortmann, Mark, Mirza Klimenta, and Ulrik Brandes. 2017. “A Sparse Stress Model.” Journal of Graph Algorithms and Applications 21 (5): 791–821. https://doi.org/10.7155/jgaa.00440.
Raj, Mukund, and Ross T. Whitaker. 2018. “Anisotropic Radial Layout for Visualizing Centrality and Structure in Graphs.” In Lecture Notes in Computer Science, 351–64. Springer International Publishing. https://doi.org/10.1007/978-3-319-73915-1_28.
Rossi, Ryan A., and Nesreen K. Ahmed. 2015. “The Network Data Repository with Interactive Graph Analytics and Visualization.” In AAAI. http://networkrepository.com.
Rüegg, Ulf, Thorsten Ehlers, Miro Spönemann, and Reinhard von Hanxleden. 2016. “A Generalization of the Directed Graph Layering Problem.” In Lecture Notes in Computer Science, 196–208. Springer International Publishing. https://doi.org/10.1007/978-3-319-50106-2_16.
Subramanian, Arvind, and Shang-Jin Wei. 2007. “The WTO Promotes Trade, Strongly but Unevenly.” Journal of International Economics 72 (1): 151–75. https://doi.org/https://doi.org/10.1016/j.jinteco.2006.07.007.
Sugiyama, Kozo, Shojiro Tagawa, and Mitsuhiko Toda. 1981. “Methods for Visual Understanding of Hierarchical System Structures.” IEEE Transactions on Systems, Man, and Cybernetics 11 (2): 109–25. https://doi.org/10.1109/tsmc.1981.4308636.
Tanahashi, Yuzuru, Chien-Hsin Hsueh, and Kwan-Liu Ma. 2015. “An Efficient Framework for Generating Storyline Visualizations from Streaming Data.” IEEE Transactions on Visualization and Computer Graphics 21 (6): 730–42. https://doi.org/10.1109/tvcg.2015.2392771.
Welch, E., and S. Kobourov. 2017. “Measuring Symmetry in Drawings of Graphs.” Computer Graphics Forum 36 (3): 341–51. https://doi.org/https://doi.org/10.1111/cgf.13192.
Zarate, David Cheng, Pierre Le Bodic, Tim Dwyer, Graeme Gange, and Peter Stuckey. 2018. “Optimal Sankey Diagrams via Integer Programming.” In 2018 IEEE Pacific Visualization Symposium (PacificVis), 135–39. https://doi.org/10.1109/PacificVis.2018.00025.
Zheng, Jonathan X., Samraat Pawar, and Dan F. M. Goodman. 2019. “Graph Drawing by Stochastic Gradient Descent.” IEEE Transactions on Visualization and Computer Graphics 25 (9): 2738–48. https://doi.org/10.1109/TVCG.2018.2859997.
Zhong, Fahai, Mingliang Xue, Jian Zhang, Fan Zhang, Rui Ban, Oliver Deussen, and Yunhai Wang. 2023. “Force-Directed Graph Layouts Revisited: A New Force Based on the t-Distribution.” IEEE Transactions on Visualization and Computer Graphics, 1–14. https://doi.org/10.1109/TVCG.2023.3238821.
Zhu, Minfeng, Wei Chen, Yuanzhe Hu, Yuxuan Hou, Liangjun Liu, and Kaiyuan Zhang. 2021. “DRGraph: An Efficient Graph Layout Algorithm for Large-Scale Graphs by Dimensionality Reduction.” IEEE Transactions on Visualization and Computer Graphics 27 (2): 1666–76. https://doi.org/10.1109/TVCG.2020.3030447.